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Abstract 

Background: The center string (or closest string) problem is a classic computer science problem with important 
applications in computational biology. Given k input strings and a distance threshold d, we search for a string 
within Hamming distance at most d to each input string. This problem is NP complete. 

Results: In this paper, we focus on exact methods for the problem that are also swift in application. We first 
introduce data reduction techniques that allow us to infer that certain instances have no solution, or that a center 
string must satisfy certain conditions. We describe how to use this information to speed up two previously 
published search tree algorithms. Then, we describe a novel iterative search strategy that is effecient in practice, 
where some of our reduction techniques can also be applied. Finally, we present results of an evaluation study for 
two different data sets from a biological application. 

Conclusions: We find that the running time for computing the optimal center string is dominated by the 
subroutine calls for d = d opt -1 and d = d opt . Our data reduction is very effective for both, either rejecting 
unsolvable instances or solving trivial positions. We find that this speeds up computations considerably. 



Background 

The CENTER STRING problem (also known as CLO- 
SEST STRING problem) is defined as follows: given k 
strings of length L over an alphabet S and a distance 
threshold d, find a string of length L that has Hamming 
distance at most d to each of the given strings. 

The CENTER STRING problem has been studied 
extensively in theoretical computer science and, particu- 
larly, in computational biology [1,2], and has various 
applications such as degenerate PCR primer design [3] 
or motif finding [1,4]. We are particularly interested in 
its application as part of finding approximate gene clus- 
ters. The increasing speed of genome sequencing and 
the resulting increase in the number of available data 
sets offers the possibility of comparing the gene order of 
whole genomes. During the course of evolution, specia- 
tion results in the divergence of genomes that initially 
have the same gene order and content. Conserved gene 
order is evidence of a particular biological signal [5]. 
Approximate gene cluster models account for reordering 
inside the gene cluster, as well as additional and missing 
genes in the genomes compared [6,7]. The center gene 
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cluster model limits the distance between the gene clus- 
ter and each of the approximate occurrences. For given 
approximate occurrences, finding the center gene cluster 
is equivalent to finding a center string for binary input 
strings. 

Previous work 

The CENTER STRING problem is NP complete [1,8], 
hence no polynomial time algorithm can exist unless P 
= NP. Different approaches have been studied for the 
problem. Ma and Sun [9] presented a polynomial time 
approximation scheme with time complexity 0(Lfe°^~ 2 ^) 
for an approximation ratio of 1 + s for any s > 0. In 
addition, heuristics and parallel implementations with 
good practical running times have been developed 
[10,11]. The drawback of these approaches is that they 
cannot guarantee that an exact solution will be found. 

In parameterized algorithmics, we use a parameter to 
describe the complexity of a problem instance. We 
restrict the super-polynomial running time of an algo- 
rithm using this parameter while at the same time still 
guaranteeing that optimal solutions are found. Formally, 
a problem with input size n and parameter k is fixed- 
parameter tractable if it can be solved in 0(f(k) • pin)) 
time, where /is an arbitrary function and p is a polyno- 
mial. Parameters that have been studied in the literature 
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for the CENTER STRING problem are the distance 
threshold d and the number of input strings k. For the 
latter parameter, Gramm et al. [12] showed that the 
problem is fixed-parameter tractable using an Integer 
Linear Program. Evaluations indicate that this approach 
is of theoretical interest only and impractical for k > 5. 
Regarding the distance threshold d, in the same paper 
an algorithm was given with running time 0(kL + kd d 
Later, Ma and Sun [9] presented an algorithm with 
running time 0(kL + kd • l6 d (\L\ - l) d ). Recently, Wang 
and Zhu [2] further improved running times to 0(kL + 
kd • 9.53*(|E| - l) d ), and Chen et al. [13] to 0(kL + 
kd 2 6.14> d ) for binary strings. These algorithms are based 
on the search tree paradigm. Note that for binary 
strings, the term - l) d vanishes. 

Besides these fixed-parameter approaches, Meneses et 
al. [14] proposed a heuristic to compute upper and 
lower bounds using a branch-and-bound algorithm and, 
very recently, Kelsey and Kotthoff [15] investigated a 
constraint programming approach. 

All of the above results, as well as our results pre- 
sented below, deal with the CENTER STRING problem 
under the Hamming distance. Nicolas and Rivals [16] 
showed that the CENTER STRING problem under the 
Levenshtein distance is NP-hard and W[l]-hard regard- 
ing the number of input strings, even for binary strings. 
On this account, no FPT algorithm with parameter k 
can exist unless FPT = W[l]. Furthermore, the authors 
generalized these results to any weighted edit distance 
satisfying a certain natural condition, namely, a slightly 
tightened triangle inequality (see Property 1 in [16] for 
details). Note, that CENTER STRING is polynomial if 
the number of input strings and the weighted edit dis- 
tance are fixed [16]. 

Our contribution 

In this paper, we focus on exact methods that are also 
swift in application. We have developed an advanced 
preprocessing to filter out unsolvable instances quickly. 
Additionally, we compute rules that can be used within 
search tree algorithms to bound the search space, 
excluding unsolvable instances. We show how to inte- 
grate this information into the algorithms from [9,12]. 
We then present a new iterative search strategy called 
MismatchCount, which, despite its bad worst-case run- 
ning time, works well in practice. We implemented all 
three algorithms to evaluate their performance in com- 
bination with our preprocessing. We present results of 
our experimental evaluation, showing that preprocessing 
and the novel algorithm improve running times by sev- 
eral orders of magnitude. We find that, in particular, the 
cases d = d opt - 1 and d = d opt are notoriously difficult 
for all approaches, where d opt is the smallest distance 
value for which a solution exists. 



A preliminary version of this paper has been published 
in Proc. of Workshop on Algorithms in Bioinformatics, 
WABI 2010, Volume 6293 of Lect. Notes Comput. Sc., 
Springer 2010:325-336. 

Methods 

Preliminaries 

For a string s over a finite alphabet E, let s[i] be the ith 
character of s and s[i, j] the substring of s starting at 
position i and ending at position j. The length of 5 is 
denoted by \s\. 

The Hamming distance d H (s, t) of two strings s and t 
of the same length L is the number of positions p with s 
[p] * t[p]. Let R = p m } <= {1,..., L} be a set of posi- 

tions such that pi <Pi+\ for all 1 < i <m. Then s\ R := s 
\Pi] s[Pm\ denotes the subsequence of s restricted to 
the positions in R. We define the Hamming distance of 
two strings s and t restricted to R as 
d^(s, t) := d H {s\ R ,t\ R ). For two strings s and t f let D Sf t := 
{p : s[p] * t[p]} <= {1,..., L} be the set of positions where 
s and t differ, and let E s> t := {p : s[p] = t[p]} = {1,..., L} 
\D S} t be the set of positions where s and t are identical. 
Note that d° st (s,t) = \D S/t \ = d H (s,t} For k input strings 
s lf ..., s k , we write A,j •= D SiiSj and E^j := E SiiSj , As noted 
in the introduction, we will often limit ourselves to a 
binary alphabet S = {0, 1}, here, we define 
sjp] = l- s[p\ 

The CENTER STRING problem is defined as follows: 
for strings s lf ..., s k of length L over an alphabet E, and a 
distance threshold d, find a string s of length L, called 
center string, which has Hamming distances at most d 
to each of the given strings. 

We note that permuting positions of all strings by the 
same permutation, results in an equivalent instance. Let 
ji be a permutation over positions 1,..., L. For a string s 
= 5(1) ... s{L) of length I, let n(s) := s(n(l)) s(n(2)) ... s 
(tt(L)) be the permuted string. Let s lf ..., s k be an instance 
of CENTER STRING problem, in which all strings have 
length L. Then, a given string 5 has Hamming distance 
at most d to all strings s lf ..., s k , if and only if n(s) has 
Hamming distance at most d to all strings nis^,..., n{s k ). 
For k strings s^..., s k and distance threshold d, we can 
construct a naive kernel as follows [12]: a position p is 
called clean if all sequences coincide at this position, i.e. 
s i[f] = s j[f] f° r au 1 < i <j < k. If a position is not clean, 
we call it dirty. One can easily see that there can be at 
most kd dirty positions if an instance with k strings 
allows for a center string with distance d. If a position is 
not dirty, then all strings share the same character at 
this position, and the center string will also share this 
character. We can thus remove all clean positions and 
obtain an instance of length L < kd. Now let us assume 
that d is given to us as a parameter. Again, we remove 
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all clean positions from the instance. If the resulting 
strings have more than kd characters, the instance can 
be rejected. Similarly, we can reject an instance that 
contains a string pair with distance larger than 2d, since 
the Hamming distance is a metric and satisfies the trian- 
gle inequality. In our algorithms, we assume a distance 
threshold d to be given. In applications, we might not 
know the distance threshold d in advance but instead 
search for a center string minimizing d. We can do so 
by calling our algorithms repeatedly, increasing d = 0, 1, 
2,... until a solution is found for d = d opt . Both in theory 
and in our experimental evaluation, we find that the 
running time of this iteration is governed by the last 
subroutine calls with d = d opt - 1 and d = d opt . To this 
end, we will put special focus on these two cases in our 
evaluations. 

Our proposed data reduction often allows us to infer 
that no solution can exist for a particular distance 
threshold d. However, where we cannot rule out the 
existence of a center string by data reduction (what is 
obviously the case when d = d opt ), we still have to 
decide whether a valid center string exists. All algo- 
rithms for doing so, such as those presented in [2,9,12] 
and the MismatchCount algorithm presented in this 
paper, scan through all 2 L possible binary strings and 
test whether any such string is a center string of the 
input. The algorithms differ in the order in which they 
process the 2 L strings and, in particular, how they con- 
strain the search space to speed up computations. 

Data reduction 

Our data reduction is based on the pairwise comparison 
of the input strings. Given an instance s lf ..., s k and d of 
the CENTER STRING problem, we can divide all pairs 
of strings {s if Sj} into three groups: pairs with distance 
less than 2d - 1, greater than 2d, or equal to 2d or 2d - 
1. If two strings s if Sj with Hamming distance d H (s i} sj) > 
2d exist, then the instance has no solution. A center 
string s can have at most distance d to each of s t and Sj 
and, hence, d H (s it sj) < d H (s if s) + d H (s, sj) < 2d. There- 
fore, d > ^-maxjj d\\{si,Sj) must hold for the instance to 

have a solution. 
Solving trivial positions 

Some positions of the solution string can be trivially 
solved. This is based on the following observation: 

Lemma 1. Given strings s lf ..., s k and a center string s 
with distance d. For two strings s if Sj such that d n {s i} sj) 
= 2d or d n {s b sj) = 2d - 1, we have 

= Si[p] = Sj[p] for all p e Ei tj . 

Proof. A center string with distance at most d to all 
strings is located centrally between the two strings s t 



and Sj with distance 2d and therefore has distance d to 
both of them. Thus, all positions fixed between s t and Sj 
must also be fixed in s. We can extend our reasoning to 
string pairs with distance 2d - 1. We need to change d 
positions in at least one of the strings and E i} j is the set 
of equal positions between both strings, hence we are 
still not allowed to change any position p e Ey. 

As a reduction rule, if we find two strings s b Sj with 
d H (Si, sj) > 2d - 1, then we can set s[p] := s^p] for all p 
e E it j and mark these positions as "permanent". Let V 
denote this set of permanent positions. We can general- 
ize this rule to solve additional positions. Assume a spe- 
cific di := d for each input string s b which is increased 
by one for every solved position that does not match s t . 
If we find two strings s if Sj with d H (s i} sj) = d t + dp we 
can again set s[p] := s t \p\ for all p e E^ ; and mark these 
positions as "permanent". In the case d t = dj the rule 
remains the same as that given above. We repeat this 
rule until no fitting string pair s b Sj can be found. 

Applying this reduction rule, we may run into conflicts 
where we have to permanently set a certain position to 
'0' and T simultaneously. We infer that the instance has 
no solution for the current choice of d. If we do not 
have a conflict, then applying this data reduction results 
in a partially solved solution string s with s[p] = c e £ 
fixed for all p E V, whereas all positions not in V still 
have to be decided. 
Computation of position subsets 

We focus next on pairs of strings s u Sj with d n {s u sj) < 
2d - 1. For a given center string s we define 

X f/j (5) := {p e E itj : Si[p] = Sj[p] Js\p]} 

as the set of positions where s t and Sj agree, but dis- 
agree with the center string s. We extend the reasoning 
behind Lemma 1 as follows: 

Lemma 2. Given strings 5!,..., s k and a center string s 
with distance d. For two strings s u Sj such that d H (s i} sj) 
<2d - 1, we have 

\Xi,j(s)\ <d-^d H {s if Sj). 

Proof Set D := D if Regarding the distances between 
s\ D and Si\ D as well as Sj\ D , we can state that s\ D has a 

distance of at least ^d\\(si,Sj) to at least one of the 

strings sj\ D or s^: 

mzx{d H (si\ D , s\ D ),d H (sj\ D ,s\ D )} > ^d H (s if Sj). 

This is true since d H is a metric and the triangle 
inequality holds, d n {sj\ D ) < d H (Si\ D , s\ D ) + d H (sj\ D , s\ D ). 

Since we need a distance of at least —d\\{si,sf) to solve 
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the positions from D, a distance of at most 
d — ^d\\{si,Sj) remains to solve the positions from Eg. 

Lemma 2 implies that the maximum number of posi- 
tions p g E u j that we are allowed to choose in the cen- 
ter string with s[p] * Si\p\ is bounded by d — ^dn(si,Sj). 

We can transform this observation into a reduction rule 
as follows: when, during search tree traversal or by 
other reduction rules, we have a partially solved solution 
string 5 such that 

PQ #J -(s)| >d - ~dH(si,Sj) 

for any pair s if Sj, then we can infer that 5 cannot be 
extended to a solution for the current choice of d. For 

each pair s b Sp we therefore set := d — ^dn{si,Sj) and 

store all tuples (Eg, Xg) in an array 7~. 

Removing redundant information from T may lead to 
further trivially solved positions. This is done by remov- 
ing, for all 1 < i <j < k, all positions p e V PI Eyfrom Eg. 
Moreover, if s\p] * s^p] then we decrease Xg by one. 

For Xg = 0 we set all positions p from Eg to "perma- 
nent" and include them in p. Since V has changed, we 
continue our data reduction again until there is no tuple 
(Ey, Xij) with Xij = 0 in T > For xy < 0 we can easily 
infer that a conflict must exist and, as a result, the 
instance has no valid solution for this distance threshold 
d. 

Cascading 

To enlarge further the number of solved positions we 
consider all pairs of strings s if Sj with Xy = 1 and use 
cascading. A valid center string s has to agree with s t in 
at least \E i} j\ - 1 positions from E itP hence for binary 
strings, at most one position p e E it j can be set to 

s[p] = W\- 

To this end, we test for all positions p e Eg what we 
can infer from setting s[p] = si[p] This implies x it ; = 0, 
hence we add the remaining positions q e Eg, q * p, to 
V and reduce the tuple set T- If we run into a conflict 
during this reduction, we know that setting s[p] = S{[p] 
cannot result in a valid solution. In this case, we infer s 
lp] - s ilp] and permanently set position p. 

Unfortunately, if there is no conflict, setting s[p] = s t 
[p] is not mandatory. Nonetheless, we get a partially 
solved solution string s p>v and a set of "potentially per- 
manent" positions V PiV depending on the position p and 
the value y = S{ [p] ♦ We store this information in a set of 
rules 

We can use the set of rules 71 when solving the remain- 
ing instance, for example by means of a search tree 



algorithm. If, during the search tree traversal, we decide 
to set s[p] = v for the solution string s, then we can 
immediately start the above data reduction. For all posi- 
tions q e Pp, v \V, we set the solution string s[q] = s PtV [q]. 
For the remaining positions q e Vp iV Pi V the condition s 
[q] = s P}V [q] must be met, otherwise we run into a conflict 
and, thus, this branch of the search tree does not lead to 
a valid solution. 

Integration into search tree algorithms 

We can use the information derived during preproces- 
sing, stored in the sets *p, T> 7Z> to speed up the algo- 
rithms of Ma and Sun [9], and Gramm et al. [12]. 
Unfortunately, the use of V, T> TZ does not change the 
worst-case running times of both algorithms. But our 
preprocessing, as an algorithm engineering technique, 
allows us to speed up the algorithms in practice. 

The algorithm of Ma and Sun tackles the more gen- 
eral NEIGHBOR STRING problem. Given s lf s 2 ,..., s k of 
length L and non-negative integers d lf d^..., d k , find a 
string s of length L such that d(s, s t ) < d t for every 1 < i 
< k. The algorithm starts by testing whether 5! is already 
a valid solution. If not, there has to be at least one H 0 
with d\\{si, Si 0 ) > di 0 . For these two strings 5! and V we 
create the sets of equal positions E := E 1; j 0 and different 
positions D := Di ; j 0 , as well as the substrings Si\ D and 
Sjjo. Note that d^(si,Si 0 ) = du(si,Si 0 ). From these 
strings, one can infer that d^{s,Si)<di and 
dft(s,Si 0 ) < d{ 0 . To fit Si to the solution string s, it is 
necessary to change the positions in D without exceed- 
ing the limits d x and d{ 0 . Thus, for any string t of length 
\D\ we test whether t = s\ D is a possible solution. 
Hence, the width of the search tree is based on the 
number of strings t that fulfill the condition 

\t\ = \D\ and d H {t, si\ D ) < d Y and d H (t, s io \ D ) < d io . 

For these eligible strings t, we obtain a new branch of 
the search tree by creating a new NEIGHBOR STRING 
instance. The new distance thresholds e t depend on the 
distance of t to the substrings Si\ D , so e t := d t _ d H (t, Si\ 
D ). For ei we have the additional constraint 

d\ 



e\ := min \ d\ — du{t, s\ \d), 



1 



For further 



information about this approach, see [9]. 

The algorithm of Gramm et al. is a depth-bounded 
search tree that is initialized with any s e {si,..., s k }, 
which is adapted step by step to the solution. The first 
step is to find a string s t that differs from the candidate 
string s in more than d positions. If no such string 
exists, then s is a valid solution. Otherwise, we change s 
until a center string s is found or more than d positions 
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in s are changed. This results in a maximum tree height 
of d. From the set D SSi we choose d + 1 positions to 
branch, leading to a total tree size of (d + l) d = 0(d d ). 
Since d < \D S/Si \ < 2d and |D$ /S .| < d, at most d elements 
from D SSi do not converge to a solution. Therefore, 
choosing d + 1 elements from D S/Si produces at least one 
exact move. For a detailed description of the algorithm, 
see [12]. 

Integrating the set of solved positions V into the 
algorithm of Ma and Sun is straightforward, since we 
can delete all solved positions and decrease d t by one 
for every mismatch. 

For the algorithm of Gramm et al we cannot have 
different d b hence we have to test whether or not the 
position is permanent within the search tree. Assume s 
is the candidate string. For any position p from V we 
set s[p] := s[p]. During the algorithm, we ensure that 
none of these positions is changed. Let Q := {1, ...,L}\V 
be the set of positions that are not permanent. For each 
string Si we can estimate the distance threshold d t 
between s\\q and s\q as described for NEIGHBOR 
STRING instances. Instead of choosing d + 1 positions 
to branch, we now have to choose only d t + 1 positions 
from D S/Si \V. Given that, for all positions in V, the can- 
didate string was set to the value of the solution string, 
there are no positions pePn D SSi with Si[p] = s[p], and 
hence \V fl D S/Si \=d- d { . Since \D SiSi \V\ <d + d { and 
\D$ iSi < d\ at most d - d + d t positions do not a converge 
to a solution. Therefore, among the d t + 1 possible 
modifications from D SiSi \V, there is at least one that 
brings us towards a solution. 

To integrate T and 7Z, we exclude branches which 
cannot produce a valid solution. Branches are pruned by 
simply testing whether the (partial) string candidate of 
the search tree conflicts with the information. For a 
position p from a particular NEIGHBOR STRING 
instance we use m(p) to denote the corresponding posi- 
tion in the original instance. 

In the algorithm of Ma and Sun, when creating all 
strings t of length \D\, we test for their consistency with 
the rules from 7Z. Assume t - pi ... p\ with p t e D, < 1 i 
< 1. For all pi £ D, 1 < i < I we check whether there is a 
rule (Smfait^Vmfaitii]) e 71 and test if the remaining 
positions in t are consistent with the partially solved 
solution string. If that is not the case, the current t will 
not lead to a valid solution. There is even more infor- 
mation in 7Z that we can use. If we find a t that is con- 
sistent with 7Z, we use the solved positions from all sets 
Pmfa) f t[i> with 1 < / < \t\, to reduce the NEIGHBOR 
STRING instance for the recursion step. For that reason 
we build an overlay of all 5m(p;)4*] \e w ^ Pi G D to £ et a 
new set of solved positions. Furthermore, we can check 
the consistency of t with 7~. For all (-Etj/^tj) £ T we test 



whether t has more inconsistent positions than are 
allowed. Assume t = p x ... Pi- We count all positions p n 
with m(p n ) e E u j and Si[m{p n )] * t[n\. If there are more 
than Xij of these positions, the current t is not consis- 
tent with 7" and hence cannot produce a valid solution. 

In the algorithm of Gramm et al., we can restrict the 
positions we can choose to branch. Assume s k is the 
string with d H (s, s k ) >d. We can only branch over a 
position p if we checked the following condition for all 
(Ei^xij) e T containing p: if s[p] = s t [p] * s k [p] we 
would have to change s[p] to s k [p], thus we would set 
s[p] = Si[p\> For that reason we have to check at how 
many positions from E it j the candidate string 5 differs 
from If this number is at least x^, we are not allowed 
to set a further position p to si[p] and hence we inter- 
dict branching over p. Now, let p be the current posi- 
tion to branch at, and set v := s^p]. If 7Z contains s p>Vi 
we have to adapt the candidate string 5 to s p>v before 
calling the recursion. If 5 conflicts with s p>v , this branch 
of the search tree does not lead to a valid solution. 

Algorithm MismatchCount 

Even after applying our data reduction rules, we have to 
solve the remaining instance using an algorithm such as 
those from [9,12]. In this section we present another 
such procedure, MismatchCount, which is effecient in 
practice, as we will show below. Given binary strings 
s lf ..., s k of length L and a distance threshold d, the Mis- 
matchCount algorithm solves the CLOSEST STRING 
problem as follows: we iterate through all strings s with 
distance at most d to a chosen string s t - without loss 
of generality, we may choose that string to be 5! = 0 ... 
0. This leaves us with a search space of size J2d'=o (d>)' 
We present an enumeration scheme for those s that 
allows effecient testing for the center condition on each 
candidate and that makes it possible to skip large areas 
of the search space based on information gained while 
checking those candidates. 

We enumerate the mismatch positions for d mis- 
matches in Si (and therefore the center string candidates 
5), which is equivalent to generating all binary numbers 
of length m with d bits set to 1, in reverse order (Figure 
1). For every s we check its Hamming distance to the 
remaining strings s 2 , s 3 ,..., s k . Rather than computing 
these distances anew for each candidate, we update the 
Hamming distances derived from the previous candidate 
s\ We do this by increasing or decreasing the distances 
to reflect the changed positions. 

The running time for verifying a center candidate s is 
therefore bounded by 0(g ♦ /<), where g is the number of 
positions changed from s to 5. 

We can determine the overall number of changes per- 
formed during the enumeration of all center candidates 
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Figure 1 Enumeration scheme for all strings s. Enumeration scheme for all strings s with Hamming distance at most 3 to a bit string s-, = 0 ... 
0 of length 5. 



such con- 



as follows: using the enumeration scheme presented, 
each position p in s is changed once to T and once to 
'0' for every configuration of s[l, p - 1] with at most d 

mismatches to sjl, p - 1]. There are ^ 

figurations for each d' = 1, 2,..., d. Summing over all 
possible combinations of p and d\ the number of bit 
changes performed can be bounded by 0(2 L ). Since we 
need to update k Hamming distances for each character 
change in s, the overall worst-case running time of the 
algorithm is bounded by 0(k • 2 L ). 

However, this worst-case analysis refers to the 
exploration of all legal mismatch configurations of 5. As 
already mentioned above, the enumeration scheme 
enables us to skip large areas of the search space. Using 
the maximum Hamming distance d mSLX = max, = 2> ...A^h 
(s, s^) computed in each iteration, we can derive a lower 
bound for the number of positions we have to change in 
5 in order to fulfill the center condition. Therefore, for 
each candidate s taken into consideration, we compute 

ma ^ , where 2 c min is the minimum num- 
ber of positions in 5 we have to change when its succes- 
sor is generated. We can use this condition in two ways. 
First, we cannot change 2 • c min positions in s by chan- 
ging the positions of fewer than c min mismatches. 
Therefore, if all current candidates s with d H (si, s) = d 
are enumerated and we encounter a candidate that 
reveals a c min >d\ we can then generate candidates with 
^h(si> s) = c m i n , without the enumeration of all s with 
d H (s 1} s) g {d, d + 1,..., c min -1}. 

Furthermore, even if c min does not exceed d' for a cur- 
rently observed candidate, we can use that bound to 
skip the enumeration of certain candidates, i.e. continue 
with the enumeration scheme where the c m i n -th mis- 
match from the right is moved next (Figure 2). The enu- 
meration steps in between can be omitted because they 
involve moving fewer than c min mismatch positions and 



we know that we have to change at least 2 • c min posi- 
tions in 5. 

Applying the data reduction to this algorithm is 
straightforward. Recall that Q is the set of positions that 
are not permanent. Then, the reduced instance is 
Si\q, —/SfelQ. When estimating for every candidate s its 
Hamming distance to each remaining string s if we have 
to add the additional amount du{s\j>, S{\-p) to the dis- 
tances of the reduced strings. This is done only once at 
the beginning, since we update the Hamming distances 
during the algorithm. 

Within the other algorithms we use the information 
from Tl and T to cut off branches of the search tree 
that cannot contain a valid solution. However, Mis- 
matchCount uses an iterative search strategy and posi- 
tions are not going to be fixed, but can be inverted 
again. Therefore the use of 7Z and T to fix positions 
interferes with the use of c min to skip the enumeration 
of certain candidates. 

Results and Discussion 

Generating center instances 

To evaluate our algorithms, we use instances generated 
in the context of finding approximate gene clusters. The 
order of genes in genomes can be used to determine the 
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Figure 2 Skip steps. Example case where a c mjn value of 3 allows 
for 4 steps to be skipped. 
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function of unknown genes, as well as the phylogenetic 
history of the organisms. On this global scale, each gene 
is represented by one character (or number), and ortho- 
logous genes are mapped to the same character. Gene 
clusters are sets of genes that occur as single contiguous 
blocks in several genomes. Unfortunately, the require- 
ment of exact occurrences of gene clusters turns out to 
be too strict for the biological application. This leads to 
the development of the center gene cluster model [7], 
which we recapitulate shortly in the following. 

Let Si,..., Sk be the genome strings, where each charac- 
ter represents a gene from the alphabet E. Let Sj[lj ... rj\ 
denote the substring of Sj from position / ; to position r ; . 
Let CS(S) c £ be the set of genes in a string S e £*. 
Finally, let D be the symmetric set distance, D(C, C ) = 
\C \C | + \C \C|. For some distance threshold S, the 
center gene cluster model asks for all gene clusters C <= 
£ of some minimal size such that, for each ; e {1,..., /<}, 
there exist / ; , r y with 

D(C, CS(S j [l j ...r j ]))<8. 

Now, the important point is that the algorithm for 
center gene cluster detection [7], computes candidates 
instead of directly finding center gene clusters. These 
candidates are intervals [l lf r^J,..., [/*,..., r k ] such that the 
sets Cj := CS(Sj[lj..jj\) might allow for some center C <= 
£ with D(C, Cj) < S for all ; = 1,..., k. Our task is to 
check if the resulting center does indeed meet the dis- 
tance threshold. 

We can transform the approximate occurrences Cp for 
; = 1,..., /<, to binary state strings by iterating over all 
genes that appear in at least one approximate occur- 
rence, using T if the approximate occurrence contains 
the gene, and '0' if it does not. The order of genes is not 
important in this transformation, but has to be identical 
for all strings, see also the Preliminaries. Searching for a 
center gene cluster under the symmetric set distance, is 
equivalent to searching for a binary string in the trans- 
formed instance under the Hamming distance. 

The resulting instances are often rather "short", as 
most approximate gene clusters contain only few genes. 
To construct longer and, hence, harder instances for 
our evaluation, we simply concatenate several of these 
short instances (that are blocks of k binary strings) into 
one long instance, being a single block of k binary 
strings. This allows us to evaluate the performance of 
the different methods at the borderline between "tract- 
able" and "intractable" instances. At the same time, we 
argue that the resulting instances are still "biologically 
valid." 

For our evaluation, we use genomes from the NCBI 
Genome database http://www.ncbi.nlm.nih.gov/sites/ 
entrez?db=genome. Grouping of genes into gene families 



is done based on the cluster of orthologous groups cate- 
gorization http://www.ncbi.nlm.nih.gov/COG/. We used 
two protocols to construct the two data sets, where we 
believe the second data set to be closer to the biological 
application that we have in mind. 

For the first data set we used five /-proteobacteria 
(Table 1). Each approximate gene cluster instance con- 
sists of five approximate occurrences, one on each gen- 
ome. An approximate gene cluster instance is converted 
to five binary strings, as described above. We concate- 
nated instances (each consisting of five strings) until the 
desired length L was reached. Additional strings were 
constructed in the same fashion, incorporating further 
cluster occurrences. We created up to 50 instances for 
each combination of k and L with k = 20, 30, 40, 50 and 
L = 250, 300,..., 500. 

We generated the second data set using 43 genomes 
(Table 2). To obtain larger instances, we concatenated 
smaller instances until a pre-defined length L was 
reached. We created 100 instances for each combination 
of k and L with k = 4, 6, 8, 10 and L = 20, 25,..., 40. 
Note that we do not "concatenate instances vertically", 
so the resulting instances are probably closer to the 
"biological truth" than those of the previous protocol. 

To compute d opt we have to increase d stepwise, start- 
ing from the lower bound for d opV given by 

Slower = ^max^jdH^i/Sj). We removed all instances that 

could not be decided for any d with di ower < d < d opt 
within a time limit of 10 minutes by any of the algo- 
rithms, since we cannot determine the right d opt . This 
left us with 664 instances for the first data set and 1957 
instances for the second one. 

Removing trivial columns 

To avoid taking trivial columns into account, we kept 
only the dirty columns, representing the "hard part" of 
the instances. We use V to denote the length of these 
reduced instances. We stress that in the following, all 



Table 1 Genomes from the NCBI Genome database for 
first data set. 



Species name 


Refseq 


Genes 


PC 


Buchnera aphidicola str. APS 


NC_002528 


607 


564 


Escherichia coli str. K-12 substr. MG1655 


NC_000913 


4493 


4149 


Haemophilus in uenzae Rd KW20 


NC_000907 


1789 


1657 


Pasteurella multocida subsp. multocida str. 


NC_002663 


2092 


2015 


Pm70 








Xylella fastidiosa 9a5c 


NC_002488 


2838 


2766 



Five 7 -proteobacteria from the NCBI Genome database, used for detection of 
approximate gene clusters to generate biological instances of the center 
string problem. 'Refseq' is the reference sequence from NCBI Genome 
database, 'PC the number of protein-coding genes. 
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Table 2 Genomes from the NCBI Genome database for 
second data set 



Species name 


Refseq 


Genes 


PC 


Aquifex oeolicus 


NC 


000918 


1580 


1529 


Clostridium acetobutylicum ATCC 824 


NC 


003030 


3843 


3671 


Corynebocterium glutamicum ATCC 13032 


NC 


003450 


3073 


2993 


Deinococcus radiodurons R1 chromosome 1, 


NC 


001263 


2687 


2629 


Deinococcus radiodurons R1 chromosome 2 


NC 


001264 


369 


268 


Fusobocterium nucleatum 


NC 


003454 


2125 


2063 


Listeria innocua Clipl 1 262 


NC 


003212 


3065 


2968 


Mesorhizobium loti 


NC_ 


002678 


6804 


674 


Mycoplasma genitalium 


NC 


000908 


524 


475 


Mycoplasma pneumoniae 


NC_ 


000912 


733 


689 


Mycoplasma pulmonis 


NC 


002771 


815 


782 


Mycobacterium tuberculosis CDC 1551 


NC 


002755 


4293 


4189 


Ralstonia solanacearum, megaplasmid 


NC 


003296 


1684 


1676 


Ralstonia solanacearum 


NC 


003295 


3503 


3437 


Rickettsia conorii str. Malish 7 


NC_ 


.003103 


1414 


1374 


Salmonella typhimurium LT2 


NC 


003197 


4620 


4423 


Staphylococcus aureus subsp. aureus N315 


NC_ 


_002745 


2664 


2583 


Synechocystis sp. PCC 6803 


NC 


00091 1 


3229 


3179 


Thermotoga maritima 


NC 


000853 


1928 


1858 


Ureaplasma urealyticum 


NC 


01 1 374 


695 


646 


Bacillus halodurans C-125 


NC 


002570 


4170 


4065 


Bacillus subtilis 


NC 


014479 


4170 


4062 


Borrelia burgdorferi 


NC 


001318 


890 


851 


Buchnera sp. APS 


NC 


002528 


607 


564 


Campylobacter jejuni 


NC 


008787 


1707 


1653 


Caulobacter crescentus 


NC 


002696 


3819 


3737 


Chlamydia pneumoniae 


NC 


000922 


1122 


1052 


Chlamydia trachomatis 


NC 


0001 1 7 


940 


895 


Escherichia coll 0157:H7 


NC 


002695 


5371 


5229 


Escherichia coll str. K-12 substr. MG1655 


NC_ 


.000913 


4493 


4149 


Haemophilus influenzae Rd 


NC_ 


.000907 


1789 


1657 


Helicobacter pylori 26695 


NC_ 


.000915 


1627 


1573 


Helicobacter pylori str. J 99 


NC_ 


.000921 


1534 


1488 


Lactococcus lactis 


NC_ 


.002662 


z4zj 


ZOZ I 


Xylella fastidiosa 


NC_ 


.002488 


2838 


2766 


Neisseria meningitidis serogroup B str. MC58 


NC_ 


.0031 12 


2225 


2063 


Pasteurella multocida PM70 


Mr 


.UUZDDj 


zuyz 


zU I j 


Pseudomonas aeruginosa PA01 


NC 


.002516 


5669 


5566 


Rickettsia prowazekii str. Madrid E 


NC 


.000963 


888 


835 


Streptococcus pneumoniae 


NC 


.012467 


2254 


2073 


Streptococcus pyogenes str. SF370 serotype 
M1 


NC 


.002737 


1810 


1696 


Treponema pallidum 


NC 


.000919 


1095 


1036 


Vibrio cholerae chromosome 1 


NC 


.012668 


2897 


2768 


Vibrio cholerae chromosome 2 


NC_ 


.012667 


1013 


1004 


Neisseria meningitidis serogroup A str. Z2491 


NC 


.003116 


2065 


1909 


Mycobacterium leprae str. TN 


NC 


.002677 


2770 


1605 



Genomes from the NCBI Genome database used for detection of approximate 
gene clusters to generate biological instances of the center string problem. 
'Refseq' is the reference sequence from NCBI Genome database, 'PC the 
number of protein-coding genes. 



computations and evaluations are performed on these 
reduced instances. The amount of reduction shows the 
difference between the two data sets. While in the first 
data set we only kept between 35.7% and 56.5% dirty 
columns, the instances from the second data set are 
much harder, containing on average between 89.0% and 
97.0% dirty columns, depending on the number of 
strings. The number of dirty columns increases with the 
number of strings (Table 3). 

We concentrate on the computation of center strings 
for d = d opt and d = d opt - 1, since these are the compu- 
tationally hard instances (Figure 4). For the parameter- 
ized algorithms, worst-case running times grow 
exponentially in d, and running times of algorithms are 
also dominated by these cases in practice. 

Excluding unsolvable instances by preprocessing 

Our preprocessing allows us to exclude unsolvable 
instances more efficiently than the computation of the 
naive kernel, when d is too small for a center string to 
exist. This is of particular interest as, here, our algo- 
rithms have to scan the complete solution space to 
ensure that no solution exists. Recall that, during the 
computation of the naive kernel, instances with more 

than kd dirty columns or d < ^maxi t jdii(si,Sj) are 

rejected, since the instance cannot have a solution for 
this choice of d. The percentage of instances excluded 
by preprocessing for d = d opt - 1 ranges between 50 and 
100 (Table 4). Our improved preprocessing always filters 
out more instances than does the naive kernel. For dif- 
ferent /c, we can exclude between 96.2% and 100% of 
instances, that have not been filtered by the naive kernel 
for the first data set and, for the second data set, we can 
exclude between 87.7% and 94.3% of instances. Recall 
that the instances we removed (261 for the first data set 
and 43 for the second one) have not been filtered by 
preprocessing for their lower bound d. Since we cannot 
determine whether this lower bound is the real d opt or 
d opt - 1, these instances are not taken into account, 
leading to the high percentages in the first data set. 

Solving trivial positions by preprocessing 

The second advantage of our method is the computa- 
tion of positions that can be trivially solved during pre- 
processing. The percentage of fixed positions is high for 

Table 3 Average percentage of dirty columns depending 
on k 

data set first (5 species) second (43 species) 

number of sequences k 20 30 40 50 4 6 8 10 

dirty columns 35.7 43.9 50.4 56.5 89.0 93.8 95.3 97.0 
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Figure 4 Average running times of all instances. Average running times of all instances for first (top) and second (bottom) data set. Running times 
are depicted in dependency on varying d around d opt . Algorithm Gramm is shown separately for the second data set due to the long running times. 



the important case d = d opt . In fact, for the first data set 
an average of 56.2% of the positions were fixed for these 
instances during preprocessing, and 31.7% for the sec- 
ond data set. Recall that MismatchCount and the algo- 
rithm of Ma and Sun work on these reduced instances. 
The number of solved positions depends on the d opt IV 
ratio of the instance, since at least V - 2d opt are fixed if 
a string pair with distance 2d opt exists, and decreases 
with increasing d opt IV (Figure 3). If we use d opt IV as a 
measure for the hardness of the instance, the difference 
between the two data sets is obvious. 

For the first data set we further observe that there is 
no "twilight zone" of fixed positions. In 80.9% of the 



Table 4 Percentage of instances excluded by 
preprocessing, for d = d opt - 1 . 

data set first (5 species) second (43 species) 

number of sequences k 20 30 40 50 4 6 8 10 



naive kernel (%) 


81.3 


82.2 


85.0 


86.1 


92.9 


68.4 


56.6 


50.0 


our preprocessing, from 


100 


100 


96.2 


100 


94.3 


89.6 


87.7 


90.6 


remaining (%) 


















total excluded instances 


100 


100 


99.4 


100 


99.6 


96.7 


94.7 


95.3 


(%) 



















instances, more than 40% of positions were fixed; in 
15.4%, the data reduction did not fix any positions, and 
in fewer than 3.8% of the instances, we observed a fixing 
of up to 40% of positions. 

Running times 

We have implemented the algorithms of Gramm et al. 
[12], Ma and Sun [9], and the MismatchCount algo- 
rithm, referred to as "Gramm", "MaSun" and "Mis- 
matchCount", respectively. These algorithms do not 
include any preprocessing beyond the naive kernel. 
Name suffix "Pre" indicates that preprocessing and algo- 
rithm engineering are enabled. For the MismatchCount 
algorithm, only the information from V is used. 

We implemented all algorithms in Java and compiled 
them with the Sun Java Standard Edition compiler version 
1.6. We did all computations on a quad-core 2.2 GHz 
AMD Opteron processor with 5 GB of main memory 
under the Solaris 10 operating system. The running times 
presented are the core running times of the algorithms 
and do not include I/O or removal of clean columns. We 
restricted running time to 10 minutes per instance. 

We first show that running times of all algorithms are 
really dominated by the cases d = d opt - 1 and d = d opt 
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the instances for the first data set (full dots) and the second data set (empty dots). Dots represent individual instances, the solid line is 
percentage for intervals of width 0.05. 



(Figure 4). It is clear that it is sufficient to concentrate 
on the two cases d = d opt - 1 and d = d opt . The short 
running times for d opt - 1 for the first data set are again 
due to the removal of instances for which the lower 
bound could not be decided. Note that if there is no 
string pair with distance 2d opt or 2d opt - 1, we cannot 
avoid calling the algorithm with d 

— d op t - 1 to ensure 

that d opt is truly optimal. 

To show how running times depend on d opV we 
pooled instances with respect to the optimum center 
distance d opt . For d = d opt - 1 we excluded all instances 
where d <V Ik after removing clean columns, or 

d < ^-maXi j^H^i^j) as these obviously have no solu- 
tion, leaving us with 644 instances for the second data 
set, while the 108 remaining instances for the first data 
set are not enough to analyze. Even if instances are not 
rejected by preprocessing, the algorithms tend to reject 
instances more quickly if the preprocessing information 
is used. Different percentages were rejected by the algo- 
rithms within different sets of time limits for the second 
data set (Table 5). 

Using data reduction and information gained during 
preprocessing reduces the running times of the algo- 
rithms for both d = d opt - 1 and d = d opt in all cases 
(Figure 5). On the first data set, MismatchCount using 
the preprocessing information outperforms the other 
algorithms, while MaSunPre is best on the second data 
set, especially where d = d opt . The improvement of 



MismatchCount is least significant since the information 
from Tl and T cannot be used. 

Conclusions 

We have presented improved preprocessing for the 
CENTER STRING problem. This is based on the 
observation that, for strings with an optimal center at 
distance d } there are usually many pairs of strings with 
distance close or equal to 2d. Our data reduction 
allows us to reject more instances that do not have a 
valid center string, and to draw conclusions about cer- 
tain positions of a center string. We show how this 
information can be used in the search tree algorithms 
of Gramm et al., and Ma and Sun. We have also pre- 
sented the MismatchCount algorithm for binary 
alphabets. 

In our experimental evaluation, we showed that, with- 
out preprocessing, the MismatchCount algorithm has 
better running times than the other two algorithms. 
Furthermore, our data reduction is very efficient and 
algorithms using this information outperform the origi- 
nal ones, with the overall best performance shown by 
MismatchCount on the first data set and the algorithm 
of Ma and Sun in combination with our preprocessing 
on the second. Our data reduction is particularly helpful 
for tackling the case d = d opt - 1, as we can exclude 
more instances. 

For the Levenshtein distance and weighted edit dis- 
tances, the CENTER STRING problem problem is W 



Table 5 Percentage of instances excluded by the algorithms within different time limits, for d = d opt - 1. 





MCPre 


MC 


MaSunPre 


MaSun 


GrammPre 


Gramm 


time limit 10 min (%) 


100 


100 


100 


100 


98.5 


9.0 


time limit 1 min (%) 


98.5 


98.5 


100 


98.5 


83.6 


4.5 


time limit 1 sec (%) 


34.3 


23.9 


89.6 


17.9 


34.3 


1.5 



Only the 67 instances of the second data set not rejected by the preprocessing were taken into account. 'MC denotes the MismatchCount algorithm. 
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first data set (top), and d = d opt - 1 (middle) as well as d = d opt (bottom) for the second data set. Note the logarithmic scale for running times. 



[l]-hard regarding the number of input strings. To the 
best of our knowledge, it is an open problem if these 
problems are W[l]-hard regarding the distance para- 
meter, too. In this case, our parameterized methods 
would be not applicable for these distances. 
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